Using Seurat, we will re-cluster the data published in Farbehi et al. 2019. This study consists of single-cell RNA sequencing of murine cardiac tissue 3 and 7 days after myocardial infarction.
Farbehi et al. 2019 - https://elifesciences.org/articles/43882
Seurat - https://satijalab.org/seurat/
library(Seurat)
library(ggplot2)
library(dplyr)
library(patchwork)
library(limma)
library(readr)
library(readxl)
getwd()
## [1] "/Users/danieljgorski/sciebo/ag_fischer_scrnaseq/Single-cell/teaching/scRNAseq-practical-course/Lab 2 - Analysis of Farbehi et al 2019"
# setwd("path/to/your/data")
Here is an example of loading in data from a typical 10x Genomics cellranger pipeline, this is generally what you would recieve from a core facility after sequencing, alignment and count matrix generation.
pbmc.data <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19/")
Create a Seurat object that includes genes that were found in a minimum of 3 cells and cells with a minimum of 200 genes.
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
To free up memory, you can remove these from the environment.
remove(pbmc.data)
remove(pbmc)
load("data/TIP_premade_object.Rdata")
The authors provided their count matrix data as a tab delimited text file (TIP_ShamVsMI_days3_7.txt). However, reading in and creating a Seurat object from a large (non-sparse) matrix requires a lot of memory. To speed things up we have prepared the seurat object before hand, which require less memory to read into R. If you would like to read in the published data yourself, de-comment the commands below.
# memory.limit(size = 20000)
# tip.data <- read.delim("data/TIP_ShamVsMI_days3_7.txt")
# TIP <- CreateSeuratObject(tip.data, min.cells = 10, min.genes = 200, project = "TIP")
# remove(tip.data)
The Seurat object we have created from the published data is called “TIP” (total interstitial population), if you simply enter this into your console, you will receieve the information on the object. Features are genes, samples are cells.
TIP
## An object of class Seurat
## 15675 features across 15073 samples within 1 assay
## Active assay: RNA (15675 features, 0 variable features)
dim(TIP) # get the dimensions of the object
## [1] 15675 15073
head(rownames(TIP)) # returns the first 5 rows (genes)
## [1] "Xkr4" "Sox17" "Mrpl15" "Lypla1" "Gm37988" "Tcea1"
head(colnames(TIP)) # returns the first 5 columns (barcodes/cells)
## [1] "AAACCTGAGAACAATC_Sham" "AAACCTGAGGCTCATT_Sham" "AAACCTGCAAGGACTG_Sham"
## [4] "AAACCTGCAGCCACCA_Sham" "AAACCTGGTCAAAGAT_Sham" "AAACCTGTCACTCTTA_Sham"
DefaultAssay(TIP) # find the current default assay
## [1] "RNA"
head(Idents(TIP)) # find the current identities
## AAACCTGAGAACAATC_Sham AAACCTGAGGCTCATT_Sham AAACCTGCAAGGACTG_Sham
## TIP TIP TIP
## AAACCTGCAGCCACCA_Sham AAACCTGGTCAAAGAT_Sham AAACCTGTCACTCTTA_Sham
## TIP TIP TIP
## Levels: TIP
The metadata slot is a useful place to store cell-level information and can be accessed multiple ways.
head(TIP@meta.data)
## orig.ident nCount_RNA nFeature_RNA
## AAACCTGAGAACAATC_Sham TIP 6502 2378
## AAACCTGAGGCTCATT_Sham TIP 12291 3328
## AAACCTGCAAGGACTG_Sham TIP 8931 2940
## AAACCTGCAGCCACCA_Sham TIP 2426 1315
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281
## AAACCTGTCACTCTTA_Sham TIP 4303 1776
head(TIP[[]])
## orig.ident nCount_RNA nFeature_RNA
## AAACCTGAGAACAATC_Sham TIP 6502 2378
## AAACCTGAGGCTCATT_Sham TIP 12291 3328
## AAACCTGCAAGGACTG_Sham TIP 8931 2940
## AAACCTGCAGCCACCA_Sham TIP 2426 1315
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281
## AAACCTGTCACTCTTA_Sham TIP 4303 1776
tail(TIP@meta.data)
## orig.ident nCount_RNA nFeature_RNA
## TTTGGTTTCGACCAGC_MI_day7 TIP 9147 3271
## TTTGTCACAACAACCT_MI_day7 TIP 7197 2338
## TTTGTCACATAGACTC_MI_day7 TIP 7653 2001
## TTTGTCAGTAGCTAAA_MI_day7 TIP 9873 2842
## TTTGTCAGTGAAATCA_MI_day7 TIP 8468 2803
## TTTGTCATCGTCTGAA_MI_day7 TIP 11990 2626
You can see the cell barcodes contain information about which timepoint/surgery they belong to. With the following code we extract that pattern, and add it to a new column in the metadata titled “Timepoint”.
TIP@meta.data$Timepoint <- ifelse(grepl("Sham", rownames(TIP@meta.data)), "Sham",
ifelse(grepl("MI_day3", rownames(TIP@meta.data)), "MI_day3",
ifelse(grepl("MI_day7", rownames(TIP@meta.data)),
"MI_day7", "NA")))
head(TIP@meta.data)
## orig.ident nCount_RNA nFeature_RNA Timepoint
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham
unique(TIP@meta.data$Timepoint)
## [1] "Sham" "MI_day3" "MI_day7"
table(TIP@meta.data$Timepoint) # see how many cells are in each group
##
## MI_day3 MI_day7 Sham
## 4073 4399 6601
Add the percentage of counts mapped to the mitochondrial genome to the metadata
TIP[["percent.mt"]] <- PercentageFeatureSet(TIP, pattern = "^mt-")
head(TIP@meta.data)
## orig.ident nCount_RNA nFeature_RNA Timepoint percent.mt
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham 3.445094
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham 1.708567
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham 3.112753
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham 1.648805
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham 2.335504
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham 2.974669
Note how many cells we have before filtering.
Cells_preQC <- length(colnames(TIP))
Cells_preQC
## [1] 15073
Plot the number of genes detected, counts and percentage of mitochondrial reads.
VlnPlot(TIP, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
Visualize low quality cells and doublets (outliers).
p1 <- FeatureScatter(TIP, feature1= "nCount_RNA", feature2= "percent.mt", cols = "black") + NoLegend() + xlab("nUMI") + ylab("%mito")
p2 <- FeatureScatter(TIP, feature1= "nCount_RNA", feature2= "nFeature_RNA", cols = "black") + NoLegend() + xlab("nUMI") + ylab("nGene")
p1 + p2
Set the “group.by” argument to a metadata column to separate the data further.
VlnPlot(TIP, features = "nFeature_RNA", group.by = "Timepoint")
You’ll notice the order of the timepoints is not ideal. This is a common problem when visualizing results. To remedy this, we can re-order the factor levels of the data in question.
TIP@meta.data$Timepoint <- factor(TIP@meta.data$Timepoint, levels = c("Sham", "MI_day3", "MI_day7"))
VlnPlot(TIP, features = "nFeature_RNA", group.by = "Timepoint")
Vizualize the number of genes detected, counts and mitochondrial reads across timepoints.
VlnPlot(TIP, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),group.by = "Timepoint", ncol = 3)
To avoid including stressed/dying cells or doublets in the downstream clustering, we subset the object to include cells that have more than 200 unique genes, but no more than 4000, as well as fewer than 5% of the reads mapped to the mitochondrial genome. These are typical cutoffs, but you should always evaluate if they make sense for your dataset. Filtering out cells with expression of more than 4000 unique genes is a quick way to remove possible doublets. It should be noted there are more elegant ways of removing doublets (e.g. DoubletFinder), but are outside the scope of this tutorial.
TIP <- subset(TIP, subset = nFeature_RNA > 200 & nFeature_RNA < 4000 & percent.mt < 5)
VlnPlot(TIP, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), group.by = "Timepoint", ncol = 3)
Then we can evaluate how many cells were kept. This is good information to see whether or not your filters were too harsh, or whether your data set contains a high amount of poor quality cells.
dim(TIP)
## [1] 15675 13867
Cells_postQC <- length(colnames(TIP))
(Cells_postQC/Cells_preQC)*100
## [1] 91.99894
SCTransform()will normalize, scale, and select highly
variable features. Upon completion it will create a corrected count
matrix that can be used for PCA.
TIP <- SCTransform(TIP, verbose = F)
TIP # active assay will have changed to SCT
## An object of class Seurat
## 31350 features across 13867 samples within 2 assays
## Active assay: SCT (15675 features, 3000 variable features)
## 1 other assay present: RNA
head(TIP@meta.data) # SCTransform creates corrected values for nCount and nFeature
## orig.ident nCount_RNA nFeature_RNA Timepoint percent.mt
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham 3.445094
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham 1.708567
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham 3.112753
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham 1.648805
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham 2.335504
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham 2.974669
## nCount_SCT nFeature_SCT
## AAACCTGAGAACAATC_Sham 5561 2378
## AAACCTGAGGCTCATT_Sham 5893 2611
## AAACCTGCAAGGACTG_Sham 5833 2851
## AAACCTGCAGCCACCA_Sham 4221 1381
## AAACCTGGTCAAAGAT_Sham 4795 1281
## AAACCTGTCACTCTTA_Sham 4660 1776
We then plot the variable genes, and label the top 15.
LabelPoints(plot = VariableFeaturePlot(TIP), points = head(VariableFeatures(TIP), 15), repel = TRUE)
Now we use principle components analysis to reduce the dimensionality of the data.
TIP <- RunPCA(TIP, npcs = 60)
Explore the PC loadings and embeddings. Loadings are the weights that transform predictors (i.e. genes) into a component. Embeddings are the coordinates that each cell takes on for each component.
Loadings(TIP, reduction = "pca")[1:5,1:5]
## PC_1 PC_2 PC_3 PC_4 PC_5
## Cd74 -0.12562347 0.006683773 -0.33863474 0.16793703 -0.018331547
## Spp1 -0.10031563 0.037730349 0.12164740 -0.01224196 -0.011496392
## Fabp4 0.06645396 -0.331609607 0.09651205 0.06778372 -0.005150919
## Gsn 0.18651272 0.184699320 0.04073824 0.07605581 0.291644429
## S100a8 -0.02096756 0.003095870 0.01814611 -0.06582093 0.011167362
VizDimLoadings(TIP, dims = 1:2, reduction = "pca") # shows loading of top ranked genes for each PC
Embeddings(TIP, reduction = "pca") [1:5, 1:5]
## PC_1 PC_2 PC_3 PC_4 PC_5
## AAACCTGAGAACAATC_Sham 31.5044267 19.063463 1.710218 1.309856 16.779891
## AAACCTGAGGCTCATT_Sham -27.4304570 -8.237995 -17.981295 25.729781 -1.990841
## AAACCTGCAAGGACTG_Sham 39.7891708 26.113364 5.101666 2.339711 19.105828
## AAACCTGCAGCCACCA_Sham 14.6134574 -7.509563 1.219115 -5.195359 -7.164749
## AAACCTGGTCAAAGAT_Sham 0.1911823 -3.659504 -27.723117 -22.118633 1.076447
Finding the first cell, PC1=-31.88, PC2=-18.87.
LabelPoints(plot = DimPlot(TIP, reduction = "pca", dims = 1:2), points = "AAACCTGAGAACAATC_Sham", repel = T, xnudge = -15, ynudge = 30)
## When using repel, set xnudge and ynudge to 0 for optimal results
To determine how many PCs to use for clustering, we can evaluate a scree plot. As the curve levels off, the PCs are no longer capturing useful variation. Ideally we choose the number of principle components at the break of the elbow. This can be subjective, try multiple iterations of clustering with different numbers of PCs. Another way to determine if PCs are capturing variation is to plot them against each other, if a spherical shape forms, the PCs are likely not capturing variation, and are not useful.
ElbowPlot(TIP, ndims = 60) # Can also use JackStraw() to compute number of "significant" PCs
p1 <- DimPlot(TIP, reduction = "pca", dims = 10:11) + NoLegend()
p2 <- DimPlot(TIP, reduction = "pca", dims = 20:21) + NoLegend()
p3 <- DimPlot(TIP, reduction = "pca", dims = 30:31) + NoLegend()
p4 <- DimPlot(TIP, reduction = "pca", dims = 40:41) + NoLegend()
p5 <- DimPlot(TIP, reduction = "pca", dims = 50:51) + NoLegend()
p6 <- DimPlot(TIP, reduction = "pca", dims = 59:60) + NoLegend()
par(mfrow= c(3,3))
p1 + p2 + p3 + p4 + p5 + p6
Heatmaps are also useful, here we select only 500 cells from the dataset and display scores for a range of PCs.
DimHeatmap(TIP, dims = 1:10, cells = 500, balanced = T)
DimHeatmap(TIP, dims = 30:40, cells = 500, balanced = T) #33rd PC seems to be reasonable cut off
DimHeatmap(TIP, dims = 50:60, cells = 500, balanced = T)
RunUMAP() runs the Uniform Manifold Approximation and
Projection (UMAP) dimensional reduction technique. This helps us
visualize our data in 2D. By running DimPlot() we can plot
our cells, colored by their current identity class.
TIP <- RunUMAP(TIP, reduction = "pca", dims = 1:33, verbose=F) # Use 33 PCs as determined above "dims=1:33"
DimPlot(TIP)
Explore the embeddings and locate the first cell.
head(Embeddings(TIP, reduction = "umap"))
## UMAP_1 UMAP_2
## AAACCTGAGAACAATC_Sham -10.573415 -3.1954724
## AAACCTGAGGCTCATT_Sham 8.436336 -0.4119199
## AAACCTGCAAGGACTG_Sham -10.761519 -3.6531009
## AAACCTGCAGCCACCA_Sham -7.807541 7.4030156
## AAACCTGGTCAAAGAT_Sham 2.187853 3.4544909
## AAACCTGTCACTCTTA_Sham -10.606962 0.5445584
LabelPoints(plot = DimPlot(TIP), points = "AAACCTGAGAACAATC_Sham", repel = T, xnudge = -15, ynudge = 30)
## When using repel, set xnudge and ynudge to 0 for optimal results
To cluster the data, we use the same number of PCs we used for the UMAP embedding. Be mindful of the resolution parameter, this will dramatically effect the number of clusters that are generated. It’s wise to explore a number of different clustering iterations using different resolutions. Be sure to verify they actually represent unique cell identities!
TIP <- FindNeighbors(TIP, dims = 1:33, verbose = F)
TIP <- FindClusters(TIP, resolution = 0.5, verbose = F)
The new default identity for the Seurat object will be set to most
recent FindClusters() result. This is also
“seurat_clusters” in the metadata.
head(TIP@meta.data)
## orig.ident nCount_RNA nFeature_RNA Timepoint percent.mt
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham 3.445094
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham 1.708567
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham 3.112753
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham 1.648805
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham 2.335504
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham 2.974669
## nCount_SCT nFeature_SCT SCT_snn_res.0.5 seurat_clusters
## AAACCTGAGAACAATC_Sham 5561 2378 6 6
## AAACCTGAGGCTCATT_Sham 5893 2611 3 3
## AAACCTGCAAGGACTG_Sham 5833 2851 6 6
## AAACCTGCAGCCACCA_Sham 4221 1381 12 12
## AAACCTGGTCAAAGAT_Sham 4795 1281 8 8
## AAACCTGTCACTCTTA_Sham 4660 1776 0 0
DimPlot(TIP, label = T)
head(Idents(TIP)) # To see active ident, or use TIP@active.ident
## AAACCTGAGAACAATC_Sham AAACCTGAGGCTCATT_Sham AAACCTGCAAGGACTG_Sham
## 6 3 6
## AAACCTGCAGCCACCA_Sham AAACCTGGTCAAAGAT_Sham AAACCTGTCACTCTTA_Sham
## 12 8 0
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
It is unclear whether cluster markers should be identified using the SCT or RNA assay, however since it is a type of differential gene expression analysis. I believe it is safer to preform cluster marker identification on the RNA assay, based on the following references.
https://www.embopress.org/doi/full/10.15252/msb.20188746
https://github.com/satijalab/seurat/discussions/4032
First, we have to normalize and scale the data in the RNA assay before we can use it.
DefaultAssay(TIP) <- "RNA" #changing the default assay to "RNA" assay
TIP <- NormalizeData(TIP, normalization.method = "LogNormalize", scale.factor = 10000, verbose = T)
all_genes <- rownames(TIP)
TIP <- ScaleData(TIP, features = all_genes, verbose = T)
## Centering and scaling data matrix
Identify cluster markers. Note when the only.pos
argument is set to TRUE, it will only return positive marker genes. This
is recommended for finding cluster biomarkers, but for normal
differential gene expression analysis, you would want to find both
positive and negative DEG. In that case, you would set
only.pos = F.
cluster_markers <- FindAllMarkers(TIP,
assay = "RNA",
verbose = F,
only.pos = T,
base = 2)
head(cluster_markers)
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## Gsn 0 3.562320 1.000 0.764 0 0 Gsn
## Dcn 0 3.422902 1.000 0.522 0 0 Dcn
## Smoc2 0 3.060662 0.945 0.176 0 0 Smoc2
## Gpx3 0 2.985357 0.983 0.256 0 0 Gpx3
## Htra3 0 2.776535 0.986 0.234 0 0 Htra3
## Dpep1 0 2.757173 0.885 0.133 0 0 Dpep1
write.csv(cluster_markers, file = "results/cluster_markers.csv", row.names = F) # saves our results
Show the top 2 cluster markers.
top2 <- cluster_markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
top2
## # A tibble: 44 × 7
## # Groups: cluster [22]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 0 3.56 1 0.764 0 0 Gsn
## 2 0 3.42 1 0.522 0 0 Dcn
## 3 0 4.66 0.894 0.185 0 1 Spp1
## 4 0 3.77 0.587 0.031 0 1 Arg1
## 5 0 4.19 0.997 0.636 0 2 Fabp4
## 6 0 3.62 0.784 0.155 0 2 Aqp1
## 7 0 3.59 0.997 0.285 0 3 H2-Eb1
## 8 0 3.52 1 0.338 0 3 H2-Aa
## 9 0 4.52 0.995 0.034 0 4 Cd79a
## 10 0 4.11 0.989 0.065 0 4 Cd79b
## # … with 34 more rows
DimPlot(TIP, label = T)
DimPlot(TIP, group.by = "Timepoint")
FeaturePlot(TIP, features = "Col1a1")
FeaturePlot(TIP, features = c("Pecam1", "Ptprc", "Col1a1"))
VlnPlot(TIP, features = "Adgre1")
Because the number of clusters generated is dependent on the
resolution parameter in the FindCluster() function, it is
important to check their validity as transcriptionally unique cell
identities. One way of checking this is to visualize your cluster marker
genes in a heatmap.
top30 <- cluster_markers %>% group_by(cluster) %>% top_n(n = 30, wt = avg_log2FC)
DoHeatmap(TIP, features = top30$gene) + NoLegend()
Check for clusters dominated by high mitochondrial reads or low gene expression
head(TIP@meta.data)
## orig.ident nCount_RNA nFeature_RNA Timepoint percent.mt
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham 3.445094
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham 1.708567
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham 3.112753
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham 1.648805
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham 2.335504
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham 2.974669
## nCount_SCT nFeature_SCT SCT_snn_res.0.5 seurat_clusters
## AAACCTGAGAACAATC_Sham 5561 2378 6 6
## AAACCTGAGGCTCATT_Sham 5893 2611 3 3
## AAACCTGCAAGGACTG_Sham 5833 2851 6 6
## AAACCTGCAGCCACCA_Sham 4221 1381 12 12
## AAACCTGGTCAAAGAT_Sham 4795 1281 8 8
## AAACCTGTCACTCTTA_Sham 4660 1776 0 0
FeaturePlot(TIP, features = "percent.mt", label = T)
VlnPlot(TIP, features = "percent.mt")
VlnPlot(TIP, features = "nFeature_RNA")
VlnPlot(TIP, features = "nCount_RNA")
average_expression <- AverageExpression(TIP, assays = "RNA")
write.csv(average_expression$RNA, file = "results/average_expression.csv") # save your results
DEG analysis can be performed on any set of groups you chose. Below we find differentially expressed genes between clusters 5 and 0.
Cluster_0_v_5 <- FindMarkers(TIP, ident.1 = "5", ident.2 = "0")
head(Cluster_0_v_5, n = 20)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## Ptpn18 0 2.263706 0.793 0.076 0
## Col3a1 0 -3.666539 0.136 0.983 0
## Stk17b 0 1.884147 0.676 0.076 0
## Arpc2 0 1.969499 0.981 0.802 0
## Ptprc 0 2.592930 0.829 0.039 0
## Rgs2 0 2.272103 0.667 0.064 0
## Ncf2 0 2.296407 0.762 0.023 0
## Dpt 0 -3.152083 0.045 0.962 0
## Fcgr2b 0 1.700231 0.537 0.021 0
## Fcgr3 0 2.696390 0.819 0.040 0
## Fcer1g 0 4.979380 1.000 0.163 0
## Arhgap30 0 1.502835 0.512 0.013 0
## Cd244 0 1.540289 0.484 0.004 0
## Cd48 0 1.657789 0.537 0.022 0
## H3f3a 0 1.620468 0.997 0.964 0
## Cd34 0 -3.076266 0.080 0.956 0
## Entpd2 0 -2.562172 0.024 0.897 0
## Gsn 0 -5.932477 0.589 1.000 0
## Cytip 0 2.101450 0.671 0.023 0
## Serping1 0 -3.979222 0.054 0.987 0
VlnPlot(TIP, features = "Gsn", idents = c("0", "5"))
VlnPlot(TIP, features = "Postn", idents = c("0", "5"))
FeaturePlot(TIP, features = "Postn", label = T)
We also have different treatments, lets use the “Timepoint” column in the metadata to calculate differentially expressed genes between MI and Sham in cluster 4.
head(TIP@meta.data)
## orig.ident nCount_RNA nFeature_RNA Timepoint percent.mt
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham 3.445094
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham 1.708567
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham 3.112753
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham 1.648805
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham 2.335504
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham 2.974669
## nCount_SCT nFeature_SCT SCT_snn_res.0.5 seurat_clusters
## AAACCTGAGAACAATC_Sham 5561 2378 6 6
## AAACCTGAGGCTCATT_Sham 5893 2611 3 3
## AAACCTGCAAGGACTG_Sham 5833 2851 6 6
## AAACCTGCAGCCACCA_Sham 4221 1381 12 12
## AAACCTGGTCAAAGAT_Sham 4795 1281 8 8
## AAACCTGTCACTCTTA_Sham 4660 1776 0 0
Cluster_4_MI_v_Sham <- FindMarkers(TIP,
subset.ident = "4",
group.by = "Timepoint",
ident.1 = c("MI_day7", "MI_day3"),
ident.2 = "Sham")
head(Cluster_4_MI_v_Sham, n = 20)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## Gsn 5.874734e-99 -2.4010838 0.403 0.956 9.208645e-95
## Rpl10-ps3 6.089405e-92 2.0075491 0.974 0.493 9.545142e-88
## Rps29 6.901688e-80 -0.6996789 1.000 1.000 1.081840e-75
## Rpl9-ps6 7.912200e-79 1.7585290 0.967 0.551 1.240237e-74
## Rps18-ps3 7.651519e-73 1.7419248 0.918 0.364 1.199376e-68
## Wdr89 1.887296e-68 1.2767161 0.987 0.776 2.958336e-64
## Gm9843 6.900207e-68 1.2894093 0.987 0.916 1.081607e-63
## Dcn 2.134981e-55 -1.7693616 0.318 0.793 3.346583e-51
## Gm11808 4.798430e-55 1.2660164 0.964 0.691 7.521540e-51
## Rpl27-ps3 1.328229e-54 1.3931717 0.905 0.447 2.081998e-50
## Rpl38 3.329097e-50 -0.6368921 1.000 1.000 5.218359e-46
## Gm8730 4.024229e-50 1.2858601 0.931 0.624 6.307979e-46
## Rpl23a-ps3 5.351983e-48 0.9988610 0.980 0.851 8.389234e-44
## Rps28 1.110127e-45 -0.5475573 1.000 1.000 1.740125e-41
## Rpl7a-ps5 4.601099e-43 1.3698474 0.593 0.118 7.212222e-39
## Uba52 4.308892e-41 -1.4209416 0.466 0.782 6.754189e-37
## Gm10076 1.275596e-36 -1.1161386 0.620 0.842 1.999497e-32
## Rpl13-ps3 1.651672e-36 1.1976014 0.767 0.369 2.588996e-32
## Rpl13a-ps1 1.932085e-36 1.2490410 0.584 0.140 3.028544e-32
## Rps27 2.339002e-36 -0.3786710 1.000 1.000 3.666386e-32
VlnPlot(TIP, idents = "4", features = "Dcn", group.by = "Timepoint")
Another useful method is to plot several genes that are a part of a
common topic or “signature”. For example, we can read in a list of genes
that make up the gene ontology term “Extracellular Matrix”. And score
each cell on the aggregate expression of this list of genes with
AddModuleScore().
Gene_ontology <- read_excel("data/gene_signatures.xlsx", sheet = "Gene_ontology")
head(Gene_ontology$`extracellular_matrix_GO:0031012`)
## [1] "Prelp" "Prelp" "Prelp" "Prelp" "Prelp" "Rarres2"
ECM_genes <- unique(Gene_ontology$`extracellular_matrix_GO:0031012`)
TIP <- AddModuleScore(TIP,
features = list(ECM_genes),
ctrl = 50,
name = "ECM")
## Warning: The following features are not present in the object: Zp3, Gm7534,
## Ush2a, Adipoq, Gpc2, Alb, Mmp24, Frem3, Pzp, Zp2, Cpn2, Gm5771, Fibcd1, Tnn,
## Hpse2, Muc2, Muc4, Gh, Otol1, Col6a4, Snorc, Ntn5, Phospho1, Mmp10, Ccn1, Mmp1b,
## Amtn, Prss1, Otogl, Vwa2, Cela2a, Frem2, Fga, Insl5, Egfl6, Ccn2, Fgg, Mmp7,
## Hrg, Ntn3, Muc5b, Rtbdn, Ccn4, Bcan, Zp1, Zpld1, Itln1, Muc6, Lama1, Col19a1,
## Lingo2, Adamts18, Dmbt1, Fgb, Enam, Dspp, Gpc5, Kng2, Zan, Coch, Umodl1,
## 9530053A07Rik, Gp2, Ccn6, Adamts13, Otog, Reg3b, Plg, Lingo3, Tgm4, Reg2, Reg1,
## Rbp3, Wnt7a, Wnt3, Matn1, Lgr6, Ambn, Serpina3k, Ihh, Sspo, Dmp1, Gp1ba, Prss3,
## Try4, Reg3g, Mepe, Zg16, Hapln2, 2300002M23Rik, Mbl2, Sftpa1, Try10, Oc90,
## Krt1, Itih1, Lingo4, Rtn4rl2, Kng1, Ins2, Ins1, Umod, Lad1, Hapln4, Cela3b, Shh,
## Muc5ac, Ncan, Tectb, Sftpb, Sftpd, Serpini2, Fcgbp, Fgl1, Tecta, Elfn2, Lrrtm4,
## Lrrtm1, Lrrc3c, Mmp21, Spock3, Ndp, Amelx, Serpina1a, Ccl21c, Ccn5, Ti, Col10a1,
## Tinag, Lrrn3, Impg2, F2, Mmp20, Ccn3, NA, not searching for symbol synonyms
head(TIP@meta.data) # this new gene signature is written into the metadata, by default there is a number added (ECM1)
## orig.ident nCount_RNA nFeature_RNA Timepoint percent.mt
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham 3.445094
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham 1.708567
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham 3.112753
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham 1.648805
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham 2.335504
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham 2.974669
## nCount_SCT nFeature_SCT SCT_snn_res.0.5 seurat_clusters
## AAACCTGAGAACAATC_Sham 5561 2378 6 6
## AAACCTGAGGCTCATT_Sham 5893 2611 3 3
## AAACCTGCAAGGACTG_Sham 5833 2851 6 6
## AAACCTGCAGCCACCA_Sham 4221 1381 12 12
## AAACCTGGTCAAAGAT_Sham 4795 1281 8 8
## AAACCTGTCACTCTTA_Sham 4660 1776 0 0
## ECM1
## AAACCTGAGAACAATC_Sham 0.31091072
## AAACCTGAGGCTCATT_Sham 0.01829678
## AAACCTGCAAGGACTG_Sham 0.37171388
## AAACCTGCAGCCACCA_Sham 0.04132743
## AAACCTGGTCAAAGAT_Sham -0.04088970
## AAACCTGTCACTCTTA_Sham 0.25029153
FeaturePlot(TIP, features = "ECM1") + ggtitle("Extracellular matrix signature")
Often an interesting question is whether or not a treatment or timepoint changes the abundance of a certain cell identitiy. Here is a way to tabulate the cluster compositions. Statistically testing these for significant changes is another issue. One such approach, termed “Differential proportion analysis” is described in the methods section of Farbehi et al. 2019, I encourage you to read it.
head(Idents(TIP)) # check your current default identity
## AAACCTGAGAACAATC_Sham AAACCTGAGGCTCATT_Sham AAACCTGCAAGGACTG_Sham
## 6 3 6
## AAACCTGCAGCCACCA_Sham AAACCTGGTCAAAGAT_Sham AAACCTGTCACTCTTA_Sham
## 12 8 0
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
table(Idents(TIP), TIP$Timepoint) # number of cells in each cluster across timepoints
##
## Sham MI_day3 MI_day7
## 0 2115 137 573
## 1 22 2130 222
## 2 1476 257 612
## 3 185 258 466
## 4 450 27 278
## 5 77 529 129
## 6 514 24 160
## 7 43 97 520
## 8 274 43 237
## 9 79 26 307
## 10 25 246 92
## 11 195 24 83
## 12 183 9 36
## 13 28 101 37
## 14 30 16 84
## 15 92 7 17
## 16 82 1 3
## 17 51 2 28
## 18 28 2 22
## 19 14 0 21
## 20 18 1 5
## 21 8 3 6
(prop.table(table(Idents(TIP), TIP$Timepoint), margin = 2))*100
##
## Sham MI_day3 MI_day7
## 0 35.31474370 3.47715736 14.55053327
## 1 0.36734012 54.06091371 5.63737938
## 2 24.64518284 6.52284264 15.54088370
## 3 3.08899649 6.54822335 11.83341798
## 4 7.51377525 0.68527919 7.05942103
## 5 1.28569043 13.42639594 3.27577450
## 6 8.58240107 0.60913706 4.06297613
## 7 0.71798297 2.46192893 13.20467242
## 8 4.57505427 1.09137056 6.01828339
## 9 1.31908499 0.65989848 7.79583545
## 10 0.41743196 6.24365482 2.33621127
## 11 3.25596928 0.60913706 2.10766887
## 12 3.05560194 0.22842640 0.91416963
## 13 0.46752379 2.56345178 0.93956323
## 14 0.50091835 0.40609137 2.13306247
## 15 1.53614961 0.17766497 0.43169121
## 16 1.36917682 0.02538071 0.07618080
## 17 0.85156120 0.05076142 0.71102082
## 18 0.46752379 0.05076142 0.55865922
## 19 0.23376190 0.00000000 0.53326562
## 20 0.30055101 0.02538071 0.12696800
## 21 0.13357823 0.07614213 0.15236160
cluster_composition <- (prop.table(table(Idents(TIP), TIP$Timepoint), margin = 2))*100
cluster_composition <- as.data.frame(cluster_composition)
colnames(cluster_composition) <- c("Cluster", "Timepoint", "Percent")
head(cluster_composition)
## Cluster Timepoint Percent
## 1 0 Sham 35.3147437
## 2 1 Sham 0.3673401
## 3 2 Sham 24.6451828
## 4 3 Sham 3.0889965
## 5 4 Sham 7.5137753
## 6 5 Sham 1.2856904
write.csv(cluster_composition, file = "results/cluster_composition.csv", row.names = F)
We can annotate the dataset by renaming the identities and storing them in the metadata. For now we will use German cities, but real cluster annotation requires you to look into the expression profile of each cluster to infer a cellular “identity”.
head(Idents(TIP)) # be sure the current default identity is the cluster numbers you wish to rename
## AAACCTGAGAACAATC_Sham AAACCTGAGGCTCATT_Sham AAACCTGCAAGGACTG_Sham
## 6 3 6
## AAACCTGCAGCCACCA_Sham AAACCTGGTCAAAGAT_Sham AAACCTGTCACTCTTA_Sham
## 12 8 0
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Idents(TIP) <- "seurat_clusters" # set the identity to seurat_clusters if not
DimPlot(TIP, label = T)
TIP <- RenameIdents(TIP,
"0"="Berlin",
"1"="Hamburg",
"2"="München",
"3"="Köln",
"4"="Frankfurt",
"5"="Stuttgart",
"6"="Düsseldorf",
"7"="Dortmund",
"8"="Essen",
"9"="Leipzig",
"10"="Bremen",
"11"="Dresden",
"12"="Hannover",
"13"="Nuremberg",
"14"="Duisburg",
"15"="Bochum",
"16"="Wuppertal",
"17"="Bielefeld",
"18"="Bonn",
"19"="Münster")
DimPlot(TIP, label = T)
head(TIP@meta.data) # New identities are only temporary, they are not written into metadata
## orig.ident nCount_RNA nFeature_RNA Timepoint percent.mt
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham 3.445094
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham 1.708567
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham 3.112753
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham 1.648805
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham 2.335504
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham 2.974669
## nCount_SCT nFeature_SCT SCT_snn_res.0.5 seurat_clusters
## AAACCTGAGAACAATC_Sham 5561 2378 6 6
## AAACCTGAGGCTCATT_Sham 5893 2611 3 3
## AAACCTGCAAGGACTG_Sham 5833 2851 6 6
## AAACCTGCAGCCACCA_Sham 4221 1381 12 12
## AAACCTGGTCAAAGAT_Sham 4795 1281 8 8
## AAACCTGTCACTCTTA_Sham 4660 1776 0 0
## ECM1
## AAACCTGAGAACAATC_Sham 0.31091072
## AAACCTGAGGCTCATT_Sham 0.01829678
## AAACCTGCAAGGACTG_Sham 0.37171388
## AAACCTGCAGCCACCA_Sham 0.04132743
## AAACCTGGTCAAAGAT_Sham -0.04088970
## AAACCTGTCACTCTTA_Sham 0.25029153
TIP@meta.data$annotation_1 <- (Idents(TIP)) # save new identities as "annotation_1" column in metadata
head(TIP@meta.data)
## orig.ident nCount_RNA nFeature_RNA Timepoint percent.mt
## AAACCTGAGAACAATC_Sham TIP 6502 2378 Sham 3.445094
## AAACCTGAGGCTCATT_Sham TIP 12291 3328 Sham 1.708567
## AAACCTGCAAGGACTG_Sham TIP 8931 2940 Sham 3.112753
## AAACCTGCAGCCACCA_Sham TIP 2426 1315 Sham 1.648805
## AAACCTGGTCAAAGAT_Sham TIP 4453 1281 Sham 2.335504
## AAACCTGTCACTCTTA_Sham TIP 4303 1776 Sham 2.974669
## nCount_SCT nFeature_SCT SCT_snn_res.0.5 seurat_clusters
## AAACCTGAGAACAATC_Sham 5561 2378 6 6
## AAACCTGAGGCTCATT_Sham 5893 2611 3 3
## AAACCTGCAAGGACTG_Sham 5833 2851 6 6
## AAACCTGCAGCCACCA_Sham 4221 1381 12 12
## AAACCTGGTCAAAGAT_Sham 4795 1281 8 8
## AAACCTGTCACTCTTA_Sham 4660 1776 0 0
## ECM1 annotation_1
## AAACCTGAGAACAATC_Sham 0.31091072 Düsseldorf
## AAACCTGAGGCTCATT_Sham 0.01829678 Köln
## AAACCTGCAAGGACTG_Sham 0.37171388 Düsseldorf
## AAACCTGCAGCCACCA_Sham 0.04132743 Hannover
## AAACCTGGTCAAAGAT_Sham -0.04088970 Essen
## AAACCTGTCACTCTTA_Sham 0.25029153 Berlin
Export the un-annotated clusters as pdf
Idents(TIP) <- "seurat_clusters" #switch back to the numbered clusters
pdf(file = "results/TIP_DimPlot.pdf", useDingbats = F, height = 5, width = 6)
DimPlot(TIP, label = T)
dev.off()
## quartz_off_screen
## 2
Export the annotated clusters as a tiff
Idents(TIP) <- "annotation_1" #switch to our annotation
tiff(filename = "results/TIP_DimPlot_annotation_1.tif",
height = 1500,
width = 1900,
units = "px",
res = 300)
DimPlot(TIP, label = T, group.by = "annotation_1", repel = T)
dev.off()
## quartz_off_screen
## 2
save(TIP, file = "results/TIP.Rdata")